Skip to content

HDDS-14859. Use RocksDb secondary instance for validating volumes.#9947

Draft
ptlrs wants to merge 9 commits intoapache:masterfrom
ptlrs:HDDS-14859-Ignore-transient-errors-while-validating-RocksDb-on-volumes
Draft

HDDS-14859. Use RocksDb secondary instance for validating volumes.#9947
ptlrs wants to merge 9 commits intoapache:masterfrom
ptlrs:HDDS-14859-Ignore-transient-errors-while-validating-RocksDb-on-volumes

Conversation

@ptlrs
Copy link
Copy Markdown
Contributor

@ptlrs ptlrs commented Mar 18, 2026

What changes were proposed in this pull request?

In the volume scanner, we open the RocksDb that is present on each volume.
There could be errors when opening this RocksDb in readonly mode.
The volume scanner should instead open the RocksDb as a secondary instance.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-14859

How was this patch tested?

https://github.com/ptlrs/ozone/actions/runs/23264628727

@errose28
Copy link
Copy Markdown
Contributor

Thanks for looking into this @ptlrs. I think we may want to use a secondary instance instead of a read-only instance for this check. It looks like it will meet the same goal of reading CURRENT and MANIFEST files, but will only fail if the DB is truly in bad health. We will need to provide it a directory to write its own log files, but we can use the volume's specific tmp directory for this, which is already used for other temporary files used during the disk check.

@ptlrs
Copy link
Copy Markdown
Contributor Author

ptlrs commented Mar 18, 2026

Thanks for taking a look @errose28. Using secondary instance had crossed my mind while writing this PR but that page was the reason I didn't make the change.

In fact I saw this comment in a different discussion which says:

Read-only instances has only the state of the db when the read-only instance is opened. It's not meant to be used when the db is also opened by a normal DB instance. The normal db instance can cause file deletions unaware to the read-only instance. Future reads issued to the read-only instance may hit IOError due to trying to read from non-existing files.

So based on the documentation and the comment, to me it appears that unless you actually perform any reads in RO mode, there won't be a problem.

We don't perform any reads during the volume check but RocksDb does read the metadata from the footer of all SST files as well the MANIFEST files and the WAL when it is opened in RO mode.

If that is where the problem is then, not sure if using a secondary instance would solve our problem.

Secondary instance appears to be the same as RO instance with the extra capability of bringing the instance upto speed with the RW instance using a manual command invocation.
This is nice to have but useful only if you actually read the data.

I don't mind changing to secondary instance, as at worst we would get the same behavior but it would be good to brainstorm what could be the differences.

@errose28
Copy link
Copy Markdown
Contributor

@ptlrs the main documentation is unfortunately vague here, but there's also this excerpt from the FAQ:

Q: Can I write to RocksDB using multiple processes?

A: No. However, it can be opened using Secondary DB. If no write goes to the database, it can be opened in read-only mode from multiple processes.

Q: Does RocksDB support multi-process read access?

A: Yes, you can read it using secondary database using DB::OpenAsSecondary(). RocksDB can also support multi-process read only process without writing the database. This can be done by opening the database with DB::OpenForReadOnly() call.

In our case we are only using one process with multiple handles. However, since writes will be going to the DB as we are checking the volume, this seems to indicate that secondary instance is still what we want to use here.

@ptlrs
Copy link
Copy Markdown
Contributor Author

ptlrs commented Mar 24, 2026

I have updated the PR to use a secondary RocksDb instance instead of a read-only instance

@ptlrs ptlrs requested review from ChenSammi and yandrey321 March 26, 2026 18:43
@ptlrs
Copy link
Copy Markdown
Contributor Author

ptlrs commented Mar 26, 2026

Hi @errose28 @ChenSammi @yandrey321, I have pushed some updates to this PR. Could you please take another look at it.

@ss77892
Copy link
Copy Markdown
Contributor

ss77892 commented Mar 30, 2026

So, we know for sure that openReadOnly might fail due to some internal work performed by RocksDB (like if there was log rotation in the middle, it might fail with FNF exception). Does opening as a secondary suffer from the same problems? If not, do we really need a secondary check? If it is, then how do we know that there will be no such failure during the second check?

@ptlrs
Copy link
Copy Markdown
Contributor Author

ptlrs commented Mar 31, 2026

@ss77892 you are right, we don't know if secondary instance will face the same fate. There is no documentation which clearly says that secondary instance behaves differently from RO instance when it comes to opening a new instance.

The core contribution of this PR is attempting to open a db twice before declaring failure.

We can update the PR to make the choice between RO and secondary instance configurable if we think such a fallback would be helpful here.

@errose28
Copy link
Copy Markdown
Contributor

There is no documentation which clearly says that secondary instance behaves differently from RO instance when it comes to opening a new instance.

I think the excerpt from the FAQ I mentioned in this comment is sufficient to indicate that secondary is expected to open cleanly while writes are ongoing and read-only is not. Whether the phrasing they've used is "clear" is debatable, and it would be nice if this was in the official doc page for the feature and not the FAQ. However from my point of view this is sufficient to design this around the assumption that secondary is expected to open cleanly while writes are happening unless the global DB files are corrupted or there is a transient IO error, which our sliding windows account for.

@ss77892
Copy link
Copy Markdown
Contributor

ss77892 commented Mar 31, 2026

However from my point of view this is sufficient to design this around the assumption that secondary is expected to open cleanly while writes are happening unless the global DB files are corrupted or there is a transient IO error, which our sliding windows account for.
Agree. Still, the purpose of this check itself is questionable. Correct me if I'm wrong, but all those 'open/openReadOnly' reads the base metadata and log files, and that wouldn't check the integrity and correctness of SST files. So any issue on this level would just cause continuous health checks without any viable results.

@ptlrs
Copy link
Copy Markdown
Contributor Author

ptlrs commented Mar 31, 2026

I think the excerpt from the FAQ I mentioned in this comment is sufficient to indicate that secondary is expected to open cleanly while writes are ongoing and read-only is not

@errose I would not infer "secondary is expected to open cleanly" based just on that FAQ.
The FAQ is more likely to be about reading the latest state of a key-value pair.
As per the RocksDb code, the race condition due to Manifest file rotation continues to affect secondary instances just like RO instances. This is a transient error which requires a retry attempt from the end-user. There could be more such errors.

@ss77892 In RO/Secondary mode, the SST files are opened and their metadata is checked. Each SST file's footer, metaindex blocks, properties block are read. No data related checks are performed though. So integrity and correctness of the SST files is partially present but not for the main contents of the SST file.
The main contents of the SST file are indirectly verified when we attempt to read and verify the blocks and chunks in the data scanner.
The aim of this check is to quickly see if a disk is healthy or not. Being able to open the DB is one of the indicators.

@ptlrs
Copy link
Copy Markdown
Contributor Author

ptlrs commented Apr 3, 2026

Hi @yandrey321 @ChenSammi @ss77892 @errose28, is there anything else we would like to update for this PR?

@ss77892
Copy link
Copy Markdown
Contributor

ss77892 commented Apr 6, 2026

@ss77892 In RO/Secondary mode, the SST files are opened and their metadata is checked. Each SST file's footer, metaindex blocks, properties block are read. No data related checks are performed though. So integrity and correctness of the SST files is partially present but not for the main contents of the SST file.

It seems that this statement is not correct. I just checked the open RO with strace, and it doesn't touch sst files at all:

[pid 1992410] openat(AT_FDCWD, "/home/ssa/rocks/12/hdds/rocksdb-sim-cluster/DS-187f5526-f62d-43a6-8436-f8e7e625e6ea/container.db/CURRENT", O_RDONLY|O_CLOEXEC) = 13
[pid 1992410] openat(AT_FDCWD, "/home/ssa/rocks/12/hdds/rocksdb-sim-cluster/DS-187f5526-f62d-43a6-8436-f8e7e625e6ea/container.db", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 13
[pid 1992410] openat(AT_FDCWD, "/home/ssa/rocks/12/hdds/rocksdb-sim-cluster/DS-187f5526-f62d-43a6-8436-f8e7e625e6ea/container.db/CURRENT", O_RDONLY|O_CLOEXEC) = 13
[pid 1992410] openat(AT_FDCWD, "/home/ssa/rocks/12/hdds/rocksdb-sim-cluster/DS-187f5526-f62d-43a6-8436-f8e7e625e6ea/container.db/MANIFEST-001732", O_RDONLY|O_CLOEXEC) = 13
[pid 1992410] openat(AT_FDCWD, "/home/ssa/rocks/12/hdds/rocksdb-sim-cluster/DS-187f5526-f62d-43a6-8436-f8e7e625e6ea/container.db", O_RDONLY|O_DIRECTORY) = 15
[pid 1992410] openat(AT_FDCWD, "/home/ssa/rocks/12/hdds/rocksdb-sim-cluster/DS-187f5526-f62d-43a6-8436-f8e7e625e6ea/container.db/IDENTITY", O_RDONLY|O_CLOEXEC) = 13
[pid 1992410] openat(AT_FDCWD, "/home/ssa/rocks/12/hdds/rocksdb-sim-cluster/DS-187f5526-f62d-43a6-8436-f8e7e625e6ea/container.db", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 13
[pid 1992410] openat(AT_FDCWD, "/home/ssa/rocks/12/hdds/rocksdb-sim-cluster/DS-187f5526-f62d-43a6-8436-f8e7e625e6ea/container.db", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 13
[pid 1992410] openat(AT_FDCWD, "/home/ssa/rocks/12/hdds/rocksdb-sim-cluster/DS-187f5526-f62d-43a6-8436-f8e7e625e6ea/container.db/CURRENT", O_RDONLY|O_CLOEXEC) = 13
[pid 1992410] openat(AT_FDCWD, "/home/ssa/rocks/12/hdds/rocksdb-sim-cluster/DS-187f5526-f62d-43a6-8436-f8e7e625e6ea/container.db/MANIFEST-001732", O_RDONLY|O_CLOEXEC) = 13
[pid 1992410] openat(AT_FDCWD, "/home/ssa/rocks/12/hdds/rocksdb-sim-cluster/DS-187f5526-f62d-43a6-8436-f8e7e625e6ea/container.db", O_RDONLY|O_DIRECTORY) = 15
[pid 1992410] openat(AT_FDCWD, "/home/ssa/rocks/12/hdds/rocksdb-sim-cluster/DS-187f5526-f62d-43a6-8436-f8e7e625e6ea/container.db/IDENTITY", O_RDONLY|O_CLOEXEC) = 13
[pid 1992410] openat(AT_FDCWD, "/home/ssa/rocks/12/hdds/rocksdb-sim-cluster/DS-187f5526-f62d-43a6-8436-f8e7e625e6ea/container.db", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 13
[pid 1992410] openat(AT_FDCWD, "/home/ssa/rocks/12/hdds/rocksdb-sim-cluster/DS-187f5526-f62d-43a6-8436-f8e7e625e6ea/container.db/001731.log", O_RDONLY|O_CLOEXEC) = 13
[pid 1992410] openat(AT_FDCWD, "/home/ssa/rocks/12/hdds/rocksdb-sim-cluster/DS-187f5526-f62d-43a6-8436-f8e7e625e6ea/container.db/001735.log", O_RDONLY|O_CLOEXEC) = 13
[pid 1992410] openat(AT_FDCWD, "/home/ssa/rocks/12/hdds/rocksdb-sim-cluster/DS-187f5526-f62d-43a6-8436-f8e7e625e6ea/container.db/001737.log", O_RDONLY|O_CLOEXEC) = 13
[pid 1992410] openat(AT_FDCWD, "/home/ssa/rocks/12/hdds/rocksdb-sim-cluster/DS-187f5526-f62d-43a6-8436-f8e7e625e6ea/container.db/001739.log", O_RDONLY|O_CLOEXEC) = 13
[pid 1992410] openat(AT_FDCWD, "/home/ssa/rocks/12/hdds/rocksdb-sim-cluster/DS-187f5526-f62d-43a6-8436-f8e7e625e6ea/container.db/001742.log", O_RDONLY|O_CLOEXEC) = 13
[pid 1992410] openat(AT_FDCWD, "/home/ssa/rocks/12/hdds/rocksdb-sim-cluster/DS-187f5526-f62d-43a6-8436-f8e7e625e6ea/container.db/001744.log", O_RDONLY|O_CLOEXEC) = 13
[pid 1992410] openat(AT_FDCWD, "/home/ssa/rocks/12/hdds/rocksdb-sim-cluster/DS-187f5526-f62d-43a6-8436-f8e7e625e6ea/container.db/001746.log", O_RDONLY|O_CLOEXEC) = 13
[pid 1992410] openat(AT_FDCWD, "/home/ssa/rocks/12/hdds/rocksdb-sim-cluster/DS-187f5526-f62d-43a6-8436-f8e7e625e6ea/container.db/001748.log", O_RDONLY|O_CLOEXEC) = 13
[pid 1992410] openat(AT_FDCWD, "/home/ssa/rocks/12/hdds/rocksdb-sim-cluster/DS-187f5526-f62d-43a6-8436-f8e7e625e6ea/container.db/001751.log", O_RDONLY|O_CLOEXEC) = 13
[pid 1992410] openat(AT_FDCWD, "/home/ssa/rocks/12/hdds/rocksdb-sim-cluster/DS-187f5526-f62d-43a6-8436-f8e7e625e6ea/container.db/001753.log", O_RDONLY|O_CLOEXEC) = 13
[pid 1992410] openat(AT_FDCWD, "/home/ssa/rocks/12/hdds/rocksdb-sim-cluster/DS-187f5526-f62d-43a6-8436-f8e7e625e6ea/container.db/001755.log", O_RDONLY|O_CLOEXEC) = 13
[pid 1992410] openat(AT_FDCWD, "/home/ssa/rocks/12/hdds/rocksdb-sim-cluster/DS-187f5526-f62d-43a6-8436-f8e7e625e6ea/container.db/001758.log", O_RDONLY|O_CLOEXEC) = 13
[pid 1992410] openat(AT_FDCWD, "/home/ssa/rocks/12/hdds/rocksdb-sim-cluster/DS-187f5526-f62d-43a6-8436-f8e7e625e6ea/container.db/001760.log", O_RDONLY|O_CLOEXEC) = 13
[pid 1992410] openat(AT_FDCWD, "/home/ssa/rocks/12/hdds/rocksdb-sim-cluster/DS-187f5526-f62d-43a6-8436-f8e7e625e6ea/container.db/001762.log", O_RDONLY|O_CLOEXEC) = 13
[pid 1992410] openat(AT_FDCWD, "/home/ssa/rocks/12/hdds/rocksdb-sim-cluster/DS-187f5526-f62d-43a6-8436-f8e7e625e6ea/container.db/001764.log", O_RDONLY|O_CLOEXEC) = 13

So, opening it RO looks quite useless because it reads the metadata and logs only, and wastes CPU reading log files into the memory.

ptlrs added 3 commits April 18, 2026 11:03
…-14859-Ignore-transient-errors-while-validating-RocksDb-on-volumes

# Conflicts:
#	hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/DatanodeConfiguration.java
#	hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/HddsVolume.java
@ptlrs ptlrs requested a review from yandrey321 April 18, 2026 19:10
@ptlrs
Copy link
Copy Markdown
Contributor Author

ptlrs commented Apr 21, 2026

Hi @yandrey321 @ChenSammi @ss77892 @errose28, can I get a review for this PR?

@ChenSammi
Copy link
Copy Markdown
Contributor

Besides open rocksdb as secondary, I don't sure we need retry the DB open blindly 2 times. Since we already have 2 failure tolerance of IO check, which can cover transient errors too. And we have the 10 minutes timeout for a volume check task, I feel this retry will increase the possibility of task timeout when the DN volume is very busy.

@ptlrs ptlrs changed the title HDDS-14859. Ignore transient errors while validating RocksDb on volumes. HDDS-14859. Use RocksDb secondary instance for validating volumes. Apr 23, 2026
@ptlrs
Copy link
Copy Markdown
Contributor Author

ptlrs commented Apr 23, 2026

@ChenSammi @yandrey321 I have removed the changes to retry opening the db based on @ChenSammi's suggestion and #9954

@ptlrs ptlrs requested a review from yandrey321 April 23, 2026 23:41
)
private boolean isDiskCheckEnabled = true;

@Config(key = "hdds.datanode.rocksdb.disk.check.io.test.enabled",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we rename it to "hdds.datanode.disk.check.rocksdb.io.test.enabled", so that all the disk check property will share the "hdds.datanode.disk.check" prefix?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have moved the switch to its own PR in #10149.

I have updated it there.

@ChenSammi
Copy link
Copy Markdown
Contributor

ChenSammi commented Apr 24, 2026

@ptlrs , thanks for updating the patch. I'm sorry that I didn't check the difference between OpenAsSecondary vs OpenReadOnly seriously in last comment.

For this link https://github.com/facebook/rocksdb/wiki/rocksdb-faq, I cannot find the obvious advantage that OpenAsSecondary over OpenReadOnly in our current DB check case. And OpenAsSecondary is obviously much expensive than OpenReadOnly, as it has the capability to catch up with normal read/write DB instance, it requires an extra directory to save it's owner data, which needs cleanup on each success or failure(which is not covered in this patch yet). If don't cleanup, and DN is running for a long time, I'm not sure how much space it will consume.
and for case like, if disk is close to full but not full yet, OpenAsSecondary call make the disk full and return fail the volume checker, shall we mark the disk check failure this time? Do we easy to know that disk full is because OpenAsSecondary or other DN operations?

I believe that disk check operations are good to be simple and quick, avoid to change the disk state by itself as possible.
@errose28 , could you reconsider your suggestion about OpenAsSecondary?

@ptlrs
Copy link
Copy Markdown
Contributor Author

ptlrs commented Apr 29, 2026

@ChenSammi

  1. We use SecondaryInstance mode because ReadOnly mode is not supposed to be used if the DB has already been opened via Read-Write mode. ReadOnly mode can have undefined behaviour

if a DB is simultaneously opened by Open and OpenForReadOnly, the read-only instance has undefined behavior (though can often succeed if quickly closed)

  1. Secondary instances won't duplicate any files in its directory.

The Read-only/Secondary instances opens all files in the primary DB Path in read-only mode and never creates any new files in that directory

rpatel@GQ2V5HK6NX disk-check % pwd
/Users/rpatel/Github/ozone/hadoop-ozone/integration-test/target/test-dir/MiniOzoneClusterImpl-ad36e20c-8035-4944-a3f3-0c81241cd28f/ozone-metadata/datanode-1/data-0/hdds/ad36e20c-8035-4944-a3f3-0c81241cd28f/tmp/disk-check
rpatel@GQ2V5HK6NX disk-check % ls -alht
total 24
-rw-r--r--@ 1 rpatel  staff   9.5K Apr 27 20:21 LOG
drwxr-xr-x@ 3 rpatel  staff    96B Apr 27 20:21 .
drwxr-xr-x@ 4 rpatel  staff   128B Apr 27 20:21 ..
  1. The TryCatchUpWithPrimary is a user triggered operation. It does not run automatically ever. When it catches up with new data, it only rebuilds the memtable and does not write any new data. Since we don't read any keys from the secondary instance, we are not concerned with this functionality and we will not call this method.
  2. Secondary instance requires an extra directory but that directory is only for saving RocksDb log files. The log files are rotated each time a secondary instance is opened. It does not store any important data.
  3. I have updated the PR to remove these log files every time after closing the secondary instance

Copy link
Copy Markdown
Contributor

@errose28 errose28 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this. The DB checks are called after an initial available space check, but should we also double check the available space in the DB health check if the open fails, since it will be writing a LOG file?

try (ManagedOptions managedOptions = new ManagedOptions();
ManagedRocksDB ignored = ManagedRocksDB.openReadOnly(managedOptions, dbFile.toString())) {
ManagedRocksDB ignored =
ManagedRocksDB.openAsSecondary(managedOptions, dbFile.toString(), getTmpDir().getPath())) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this comment it looks like the logs are being written to the disk check dir, but it seems like the code doesn't match.

Suggested change
ManagedRocksDB.openAsSecondary(managedOptions, dbFile.toString(), getTmpDir().getPath())) {
ManagedRocksDB.openAsSecondary(managedOptions, dbFile.toString(), getDiskCheckDir().getPath())) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated the PR to remove these log files every time after closing the secondary instance

I'm also not seeing this. Are there still some commits pending?

);
}

public static ManagedRocksDB openAsSecondary(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ptlrs thanks for all the research you've done on secondary instances. It would be great to put a summary as a javadoc above this method about how they work and what we can expect vs. readonly instances.

}

@VisibleForTesting
public VolumeCheckResult checkDbHealth(File dbFile) throws InterruptedException {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add javadoc to this method to clarify that our goal is to check global files like CURRENT and MANIFEST, and that verifying the contents of all SST files/values in the DB is done by the container data scanner.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants